3 research outputs found

    A standard tag set expounding traditional morphological features for Arabic language part-of-speech tagging

    Get PDF
    The SALMA Morphological Features Tag Set (SALMA, Sawalha Atwell Leeds Morphological Analysis tag set for Arabic) captures long-established traditional morphological features of grammar and Arabic, in a compact yet transparent notation. First, we introduce Part-of-Speech tagging and tag set standards for English and other European languages, and then survey Arabic Part-of-Speech taggers and corpora, and long-established Arabic traditions in analysis of morphology. A range of existing Arabic Part-of-Speech tag sets are illustrated and compared; and we review generic design criteria for corpus tag sets. For a morphologically-rich language like Arabic, the Part-of-Speech tag set should be defined in terms of morphological features characterizing word structure. We describe the SALMA Tag Set in detail, explaining and illustrating each feature and possible values. In our analysis, a tag consists of 22 characters; each position represents a feature and the letter at that location represents a value or attribute of the morphological feature; the dash ‘-’ represents a feature not relevant to a given word. The first character shows the main Parts of Speech, from: noun, verb, particle, punctuation, and Other (residual); these last two are an extension to the traditional three classes to handle modern texts. ‘Noun’ in Arabic subsumes what are traditionally referred to in English as ‘noun’ and ‘adjective’. The characters 2, 3, and 4 are used to represent subcategories; traditional Arabic grammar recognizes 34 subclasses of noun (letter 2), 3 subclasses of verb (letter 3), 21 subclasses of particle (letter 4). Others (residuals) and punctuation marks are represented in letters 5 and 6 respectively. The next letters represent traditional morphological features: gender (7), number (8), person (9), inflectional morphology (10) case or mood (11), case and mood marks (12), definiteness (13), voice (14), emphasized and non-emphasized (15), transitivity (16), rational (17), declension and conjugation (18). Finally there are four characters representing morphological information which is useful in Arabic text analysis, although not all linguists would count these as traditional features: unaugmented and augmented (19), number of root letters (20), verb root (21), types of nouns according to their final letters (22). The SALMA Tag Set is not tied to a specific tagging algorithm or theory, and other tag sets could be mapped onto this standard, to simplify and promote comparisons between and reuse of Arabic taggers and tagged corpora

    Morphological Tagging of the Qur’an

    No full text
    We present a computational system for morphological tagging of the Qur’an, for research and teaching purposes. The system facilitates a variety of queries on the Qur’anic text that make reference not only to the words but also to their linguistic attributes. The core of the system is a set of finite-state based rules which describe the morpho-phonological and morphosyntactic phenomena of the Qur’anic language. Using a finite-state toolbox we apply the rules to the Qur’anic text and obtain full morphological taggin

    Morphological Analysis of the Qur’an

    No full text
    We present a computational system for morphological analysis and an-notation of the Qur’an, for research and teaching purposes. The system facil-itates a variety of queries on the Qur’anic text that make reference not only to the words but also to their linguistic attributes. The core of the system is a set of finite-state based rules which describe the morpho-phonological and morpho-syntactic phenomena of the Qur’anic language. Using a finite-state toolbox we apply the rules to the Qur’anic text and obtain full morphological analysis of its words. The results of the analysis are stored in an efficient database and are accessed through a graphical user interface which facili-tates the presentation of complex queries. The system is currently being used for teaching and research purposes; we exemplify its usefulness for investi-gating several morphological, syntactic, semantic and stylistic aspects of the Qur’anic text. 2
    corecore